Explore and Summarize Data/Red Wine Quality by Ayah AlHamdan

Introduction

In this project, I will use the data of Red Wine Quality to perform exploratory data analysis using R to know what influences the quality of red wines.

Data Overview

First, we’ll have a look at the data.

## [1] 1599   13
## Observations: 1,599
## Variables: 13
## $ X                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, ...
## $ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,...
## $ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,...
## $ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2...
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,...
## $ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15...
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, ...
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0...
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,...
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,...
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, ...
## $ quality              <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5,...

There are 1599 observations and 13 variables. The X variable is an index for each observation in the dataset, while the other variables are chemical properties and the quality of the red_wine.

Univariate Plots Section

Let’s look at the distribution the variables:

## Using  as id variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Almost all the histograms above are normally distributed except for residual sugar and chlorides where they appear to be right skewed. Outliers can cause that skewness in the distribution. As we can see in the previous boxplots, these two variables have so many outliers.

To visualize the variability of the variables, we can use a boxplot for each one:

From the visualizations above, we can see the minimum and maximum values of each variable along with the median and outliers.

Let’s look closer into them:

Residual Sugar:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

For the residual sugar, there are many outliers. Most of the data falls between 0.5 and 4, samples that have more that that are outliers.

Alcohol:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

In alcohol content, there are not much outliers. Most samples contain alcohol between 9 and 10. Only a few have greater than 11.

Sulphates:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The mean value of sulphates is 0.66, values higher than 1 are considered outliers.

Fixed Acidity:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The fixed acidity has a mean of 8.32, and most samples have fixed acidity values between 7 and 9. Samples that have values higher than 12 are extreme outliers.

Volatile Acidity:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Volatile acidity is normally distributed. The mean value is 0.5 and there are only few outliers.

Citric Acid :

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The values of citric acid are between 0 and 0.75, there’s only one outlier with a value of 1. The mean is 0.27 .

Chlorides:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

If we ignore the outliers, the chlorides values would be normally distributed. All values are very close and fall between .07 and .09 with a mean of .087 .

Free Sulfur Dioxide :

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

The free sulfur dioxide distribution appears to be right skewed. Most values are between 7 and 21. Values higher than 40 are extereme outliers.

Total Sulfur Dioxide:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Total sulfur dioxide is also right skewed. The mean value is 46 and there are extreme outliers that have values greater than 100.

Density:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

The density is normally distributed with a mean value of 0.997.

pH:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The pH distribution is also normal. The first and third quartile are 3.2 and 3.4 respectively, with a mean of 3.3. Values outside of that range are outliers.

There are strong correlation coefficient between some of the variables in the dataset:

From this scatter plot, it is clear that there is a strong negative correlation coefficient between pH and fixed acidity.

There is a strong positive correlation coefficient between fixed acidity and density.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

There’s a positive correlation coefficient between quality and sulphates. Since most samples have quality rating of 5 and 6, the mean value of sulphates is around 0.66 .

Univariate Analysis

What is the structure of your dataset?

The dataset contains 1599 observations of red wine samples, and 12 variables that discribe the chemical properties of each sample along with its quality.

What is/are the main feature(s) of interest in your dataset?

The aim of this data exploration and analysis is to see what could affect the quality of wine. So the main feature of interest in this dataset is the quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

All the chemical properties will help support the investigation. These properties will definitely have an effect on the quality.

Did you create any new variables from existing variables in the dataset?

No, no new variables were created.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

There was no any unusual distributions.The dataset is tidy and complete, no change or adjustment was needed.

Bivariate Plots Section

We can visualize the relationship between the variables and their correlations in the matrix below:

Since we’re interested in the quality of the wine sample, we will focus on the highest correlation coefficients with quality. The top two are alcohol and sulphates with 0.48 and 0.25 coefficients respectively.

Let’s look into how quality is effected by alcohol:

## red_wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## red_wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## red_wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## red_wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## red_wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## red_wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

It appears that the more alcohol content, the better quality samples get. The mean alcohol value in samples that have quality 8 rating is 12.09

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

From the previous observation, it comes into view that the strongest relationship with quality is the amount of alcohol the sample has. Also, there are other features that has a great effect on quality like sulphates and citric acid.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I assumed that the more alcohol content in the sample, the higher sugar it will contain. Interestingly, in the dataset of red wines it appears that this is not the case. There is no relationship between the alcohol content and the amount of sugar.

## round(red_wine$alcohol): 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   1.700   1.800   1.833   1.950   2.100 
## -------------------------------------------------------- 
## round(red_wine$alcohol): 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.30    1.90    2.10    2.72    2.60   15.50 
## -------------------------------------------------------- 
## round(red_wine$alcohol): 10
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.900   2.100   2.406   2.600  13.900 
## -------------------------------------------------------- 
## round(red_wine$alcohol): 11
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   2.000   2.200   2.507   2.600   9.000 
## -------------------------------------------------------- 
## round(red_wine$alcohol): 12
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   2.000   2.300   2.728   2.800  12.900 
## -------------------------------------------------------- 
## round(red_wine$alcohol): 13
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   2.100   2.400   2.739   3.050   6.400 
## -------------------------------------------------------- 
## round(red_wine$alcohol): 14
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   1.800   1.800   2.131   2.200   4.300 
## -------------------------------------------------------- 
## round(red_wine$alcohol): 15
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     7.5     7.5     7.5     7.5     7.5     7.5

We can observe from the summary above that when alcohol is between the values 9 and 13, the sugar will be higher than when alcohol is 14.

What was the strongest relationship you found?

The relationship between alcohol content and quality raiting.

Multivariate Plots Section

Let’s see how alcohol, volatile acidity and citric acid are related to red wines quality:

It appears that having high alcohol and citric acid with low volatile acidity, results in a high quality rating in red wines.

Now let’s look closer on how volatile acidity related with quality:

The best quality wine samples are those who have high alcohol content and low volatile acidity.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

It is obserevd that the samples that have high alcohol content along with high citric acid are are the best quality wines.

Were there any interesting or surprising interactions between features?

The high quality of red wines contain lower residual sugar levels.


Final Plots and Summary

Plot One

Description One

In this red wine samples dataset, the lowest quality rating is 3 and the highest is 8. The quality is normally distributed, most samples have ratings of 5 and 6 and only few have less or more than that.

Plot Two

Description Two

The graph above clearly shows that the red wine with high quality (green) appears to be in the left side where the volatile acidity is low. We can see that quality rating 5, 7 and 8 have densities higher than 3, whereas the rest falls between 1 and 2.5 .

Plot Three

Description Three

As shown in the graph above, the higher quality (darker spots) appear in the upper right part of the graph where we have greater citric acid and larger alcohol content.


Reflection

The most challenging part in this project was that I had no background knowledge about wines and their quality since I come from a country where we don’t drink alcoholic beverages. I chose this dataset to enrich my knowledge and learn more about how wines are considered high quality. Since this topic is new to me, I found everything in this exploratory data analysis to be very interesting.

In this red wine samples dataset, it appears that the best wine quality contains high alcohol, citric acid, sulphates (positive correlations) and low volatile acidity (negative correlation). This dataset has no samples that are rated below 3 or above 8. Having a larger sample that covers all quality rating range would further improve the analysis. Prediction models could be done to predict the quality of wine and test these trends.